{"query": "How did they obtain the dataset?", "gt": "public resources where suspicious Twitter accounts were annotated, list with another 32 Twitter accounts from BIBREF19 that are considered trustworthy", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 20 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset was obtained by collecting tweets from two lists of annotated accounts, one for non-factual and one for factual accounts, and discarding duplicate and media-based tweets.", "evaluation": "f1"}
{"query": "What activation function do they use in their model?", "gt": "relu, selu, tanh", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to your question:\n\nReLU activation.", "evaluation": "f1"}
{"query": "What baselines do they compare to?", "gt": "Top-$k$ replies, likes, or re-tweets, FacTweet (tweet-level), LR + All Features (chunk-level), LR + All Features (tweet-level), Tweet2vec, LR + Bag-of-words", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 21 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to the question with 21 words:\n\nThey compare to the following baselines: LR + Bag-of-words, Tweet2vec, LR + All Features (tweet-level), and LR + All Features (chunk-level).", "evaluation": "f1"}
{"query": "How are chunks defined?", "gt": "Chunks is group of tweets from single account that  is consecutive in time - idea is that this group can show secret intention of malicious accounts.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 26 words.[/INST]", "llama2-13b-chat-4k_pred": "  Chunks are defined as sequences of tweets posted by a single account within a specific time frame, typically ranging from a few minutes to several hours.", "evaluation": "f1"}
{"query": "What features are extracted?", "gt": "Sentiment, Morality, Style, Words embeddings", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 5 words.[/INST]", "llama2-13b-chat-4k_pred": "  Emotions, sentiments, morality, style, and words.", "evaluation": "f1"}
{"query": "What features are extracted?", "gt": "15 emotion types, sentiment classes, positive and negative, care, harm, fairness, cheating, loyalty, betrayal, authority, subversion, sanctity, and degradation, count of question marks, exclamation marks, consecutive characters and letters, links, hashtags, users' mentions, uppercase ratio, tweet length, words embeddings", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 39 words.[/INST]", "llama2-13b-chat-4k_pred": "  The features extracted include emotions, sentiment, morality, style, and words embeddings.", "evaluation": "f1"}
{"query": "Was the approach used in this work to detect fake news fully supervised?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Fully supervised.", "evaluation": "f1"}
{"query": "Based on this paper, what is the more predictive set of features to detect fake news?", "gt": "words embeddings, style, and morality features", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 6 words.[/INST]", "llama2-13b-chat-4k_pred": "  Based on the paper, the most predictive set of features to detect fake news is the combination of word embeddings, style, and morality features.", "evaluation": "f1"}
{"query": "How big is the dataset used in this work?", "gt": "Total dataset size: 171 account (522967 tweets)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 7 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset used in this work contains 180 Twitter accounts.", "evaluation": "f1"}
{"query": "How is a \"chunk of posts\" defined in this work?", "gt": "chunk consists of a sorted sequence of tweets labeled by the label of its corresponding account", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 16 words.[/INST]", "llama2-13b-chat-4k_pred": "  In this work, a \"chunk of posts\" refers to a sequence of tweets from a single Twitter account, sorted by posting date and divided into smaller groups of a fixed size (e.g., 20 tweets).", "evaluation": "f1"}
{"query": "What baselines were used in this work?", "gt": "LR + Bag-of-words, Tweet2vec, LR + All Features (tweet-level), LR + All Features (chunk-level), FacTweet (tweet-level), Top-$k$ replies, likes, or re-tweets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 21 words.[/INST]", "llama2-13b-chat-4k_pred": "  The baselines used in this work are: LR + Bag-of-words, Tweet2vec, LR + All Features (tweet-level), LR + All Features (chunk-level), and Top-$k$ replies, likes, or re-tweets.", "evaluation": "f1"}
{"query": "What models are used in the experiment?", "gt": "linear SVM, bidirectional Long Short-Term-Memory (BiLSTM), Convolutional Neural Network (CNN)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The experiment uses linear SVM, BiLSTM, and CNN models.", "evaluation": "f1"}
{"query": "What are the differences between this dataset and pre-existing ones?", "gt": "no prior work has explored the target of the offensive language", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 11 words.[/INST]", "llama2-13b-chat-4k_pred": "  This dataset includes annotation of type and target of offensive language, while others only have binary offensive/not offensive labels.", "evaluation": "f1"}
{"query": "In what language are the tweets?", "gt": "English", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  English.", "evaluation": "f1"}
{"query": "What is the size of the new dataset?", "gt": "14,100 tweets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  The new dataset size is 14,100 tweets.", "evaluation": "f1"}
{"query": "What is the size of the new dataset?", "gt": "Dataset contains total of 14100 annotations.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 6 words.[/INST]", "llama2-13b-chat-4k_pred": "  The new dataset is 14,100 tweets large.", "evaluation": "f1"}
{"query": "What kinds of offensive content are explored?", "gt": "non-targeted profanity and swearing, targeted insults such as cyberbullying, offensive content related to ethnicity, gender or sexual orientation, political affiliation, religious belief, and anything belonging to hate speech", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 28 words.[/INST]", "llama2-13b-chat-4k_pred": "  The paper explores various types of offensive content, including abusive language, cyberbullying, hate speech, and profanity, with a focus on identifying the targets of offensive language.", "evaluation": "f1"}
{"query": "What is the best performing model?", "gt": "CNN ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Best performing model: CNN.", "evaluation": "f1"}
{"query": "How many annotators participated?", "gt": "five annotators", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Six experts.", "evaluation": "f1"}
{"query": "What is the definition of offensive language?", "gt": " Most prior work focuses on a different aspect of offensive language such as abusive language BIBREF0 , BIBREF1 , (cyber-)aggression BIBREF2 , (cyber-)bullying BIBREF3 , BIBREF4 , toxic comments INLINEFORM0 , hate speech BIBREF5 , BIBREF6 , BIBREF7 , BIBREF8 , BIBREF9 , BIBREF10 , and offensive language BIBREF11 . Prior work has focused on these aspects of offensive language in Twitter BIBREF3 , BIBREF7 , BIBREF8 , BIBREF11 , Wikipedia comments, and Facebook posts BIBREF2 .", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 77 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to the question with 77 words:\n\nOffensive language refers to words or phrases that are considered rude, disrespectful, or hurtful to certain individuals or groups. It can include profanity, derogatory terms, slurs, and other language that is intended to degrade or demean others based on their race, gender, sexual orientation, religion, or other personal characteristics. Offensive language can cause harm and perpetuate negative stereotypes, and it is important to avoid using it in order to promote respect and inclusivity.", "evaluation": "f1"}
{"query": "What are the three layers of the annotation scheme?", "gt": "Level A: Offensive language Detection\n, Level B: Categorization of Offensive Language\n, Level C: Offensive Language Target Identification\n", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 19 words.[/INST]", "llama2-13b-chat-4k_pred": "  The three layers of the annotation scheme are: (1) offensive language detection, (2) categorization of offensive language, and (3) offensive language target identification.", "evaluation": "f1"}
{"query": "How long is the dataset for each step of hierarchy?", "gt": "Level A: 14100 Tweets\nLevel B: 4640 Tweets\nLevel C: 4089 Tweets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset size for each step of the hierarchy is as follows:\n\n1. Level A (Offensive language detection): 14,100 tweets.\n2. Level B (Categorization of offensive language): 14,100 tweets.\n3. Level C (Offensive language target identification): 14,100 tweets.\n\nNote that the dataset size is the same for all three levels of the hierarchy, as they are all based on the same dataset of tweets.", "evaluation": "f1"}
{"query": "What is the approach of previous work?", "gt": "'shallow' naive Bayes, SVM, hierarchical stacked classifiers, bidirectional recurrent neural networks", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 11 words.[/INST]", "llama2-13b-chat-4k_pred": "  Previous work has focused on developing shallow naive Bayes classifiers.", "evaluation": "f1"}
{"query": "Is the lexicon the same for all languages?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "How do they obtain the lexicon?", "gt": "built over all the data and therefore includes the vocabulary from both the training and testing sets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 17 words.[/INST]", "llama2-13b-chat-4k_pred": "  The lexicon is built over all the data and includes the vocabulary from both the training and testing sets.", "evaluation": "f1"}
{"query": "What evaluation metric is used?", "gt": "average classification accuracy, execution performance", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 5 words.[/INST]", "llama2-13b-chat-4k_pred": "  The evaluation metric used is accuracy.", "evaluation": "f1"}
{"query": "Which languages are similar to each other?", "gt": "Nguni languages (zul, xho, nbl, ssw), Sotho languages (nso, sot, tsn)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 11 words.[/INST]", "llama2-13b-chat-4k_pred": "  The Nguni languages (zul, xho, nbl, ssw) are similar to each other.", "evaluation": "f1"}
{"query": "Which datasets are employed for South African languages LID?", "gt": "DSL 2015, DSL 2017, JW300 parallel corpus , NCHLT text corpora", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 11 words.[/INST]", "llama2-13b-chat-4k_pred": "  The paper employs the NCHLT text corpora and the DSL 2015 & 2017 shared tasks datasets for South African languages LID.", "evaluation": "f1"}
{"query": "Does the paper report the performance of a baseline model on South African languages LID?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "Does the algorithm improve on the state-of-the-art methods?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "Do they report results only on English data?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "Do they report results only on English data?", "gt": "The authors showed few tweets where neither and implicit hatred content exist but the model was able to discriminate", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 19 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes, the results are reported only on English data.", "evaluation": "f1"}
{"query": "What evidence do the authors present that the model can capture some biases in data annotation and collection?", "gt": "Waseem-dataset, Davidson-dataset,", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  The authors present evidence that the model can capture biases in data annotation and collection by examining the results of the transfer learning approach and finding that the model can detect some biases in the process of collecting or annotating datasets. Specifically, the authors find that the model can capture some biases in the data collection and annotation process, such as the underrepresentation of certain groups or the overrepresentation of certain types of language. This evidence suggests that the model is able to learn and capture some of the biases present in the data it is trained on, which is an important step towards debiasing hate speech datasets in future studies.", "evaluation": "f1"}
{"query": "Which publicly available datasets are used?", "gt": "Waseem and Hovy BIBREF5, Davidson et al. BIBREF9, and Waseem et al. BIBREF10", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 13 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document discusses the use of publicly available datasets for hate speech detection. The author mentions that the datasets are publicly available and provides a reference to the dataset.\n\nBased on the information provided in the document, the publicly available datasets used for hate speech detection are:\n\n1. Twitter datasets annotated for offensive language detection.\n2. Datasets containing hate speech keywords.\n3. A corpus of more than 16,000 tweets annotated as racism, sexism, or neither.\n\nThe author does not provide the specific names of the datasets, but they are publicly available and can be found through references provided in the document.", "evaluation": "f1"}
{"query": "What baseline is used?", "gt": "BERT based fine-tuning, Insert nonlinear layers, Insert Bi-LSTM layer, Insert CNN layer", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The baseline used is the pre-trained BERT model.", "evaluation": "f1"}
{"query": "What new fine-tuning methods are presented?", "gt": "sampling tweets from specific keywords create systematic and substancial racial biases in datasets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 13 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document presents new fine-tuning methods for hate speech detection, including using different transformer encoders and fine-tuning with different learning rates.", "evaluation": "f1"}
{"query": "What are the existing biases?", "gt": "Data annotation biases where tweet containing disrespectful words are annotated as hate or offensive without any presumption about the social context of tweeters", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 23 words.[/INST]", "llama2-13b-chat-4k_pred": "  The existing biases in the process of collecting or annotating datasets for online hate speech and offensive language detection have been identified, including racial and gender biases in the annotation schemes, labels, or geographic and cultural influences from data sampling.", "evaluation": "f1"}
{"query": "What biases does their model capture?", "gt": "Waseem and Hovy BIBREF5, Davidson et al. BIBREF9, and Waseem et al. BIBREF10", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 13 words.[/INST]", "llama2-13b-chat-4k_pred": "  The model captures biases in the training data, such as racial and gender stereotypes.", "evaluation": "f1"}
{"query": "What does the \"sensitivity\" quantity denote?", "gt": "the number of distinct word recognition outputs that an attacker can induce", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The \"sensitivity\" quantity denotes the expected number of unique outputs a word recognition system assigns to a set of adversarial perturbations.", "evaluation": "f1"}
{"query": "What end tasks do they evaluate on?", "gt": "Sentiment analysis and paraphrase detection under adversarial attacks", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 8 words.[/INST]", "llama2-13b-chat-4k_pred": "  The end tasks they evaluate on are sentiment analysis, spam detection, and programmatic censorship.", "evaluation": "f1"}
{"query": "What is a semicharacter architecture?", "gt": "A semi-character based RNN (ScRNN) treats the first and last characters individually, and is agnostic to the ordering of the internal characters", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 22 words.[/INST]", "llama2-13b-chat-4k_pred": "  A semicharacter architecture is a type of neural network architecture that processes text data at the level of individual characters, rather than whole words or phrases. This allows the model to capture more detailed information about the structure and meaning of text, and can lead to improved performance on tasks such as language translation and text summarization. Semicharacter architectures are typically used in natural language processing (NLP) tasks, where the input data is a sequence of characters or words. The architecture consists of a series of layers that process the input data, using techniques such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to extract meaningful features and patterns. The output of the model is a prediction of the desired output, such as a translated sentence or a summary of the input text. Semicharacter architectures have been shown to be effective in a variety of NLP tasks, and are an important tool for researchers and developers working in the field of natural language processing.", "evaluation": "f1"}
{"query": "Do they experiment with offering multiple candidate corrections and voting on the model output, since this seems highly likely to outperform a one-best correction?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "Why is the adversarial setting appropriate for misspelling recognition?", "gt": "Adversarial misspellings are a real-world problem", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 6 words.[/INST]", "llama2-13b-chat-4k_pred": "  Adversarial setting appropriate for misspelling recognition because it simulates real-world attacks.", "evaluation": "f1"}
{"query": "How do the backoff strategies work?", "gt": "Pass-through passes the possibly misspelled word as is, backoff to neutral word backs off to a word with similar distribution across classes and backoff to background model backs off to a more generic word recognition model trained with larger and less specialized corpus.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 43 words.[/INST]", "llama2-13b-chat-4k_pred": "  The backoff strategies in the paper work as follows:\n\n1. Pass-through: The word recognition model passes on the (possibly misspelled) word as is to the downstream classifier.\n2. Backoff to neutral word: The word recognition model backs off to a neutral word like 'a' when it predicts UNK.\n3. Backoff to background model: The word recognition model falls back upon a more generic word recognition model trained upon a larger, less-specialized corpus when it predicts UNK.\n\nThese backoff strategies are designed to handle rare and unseen words in the wild, which are critical in the adversarial setting. By using these strategies, the model can still achieve high accuracy even when the input words are misspelled or unobserved.", "evaluation": "f1"}
{"query": "What baseline model is used?", "gt": "same baseline as used by lang2011unsupervised", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 6 words.[/INST]", "llama2-13b-chat-4k_pred": "  Baseline model is unsupervised SRL.", "evaluation": "f1"}
{"query": "Which additional latent variables are used in the model?", "gt": "CLV as a parent of the two corresponding role variables", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The model uses additional latent variables for crosslingual semantic role induction.", "evaluation": "f1"}
{"query": "Which additional latent variables are used in the model?", "gt": "crosslingual latent variables", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  Additional latent variables used in the model include:\n\n* Crosslingual latent variables to incorporate soft role agreement between aligned constituents.\n* Multilingual latent variables to model the joint probability of the arguments in multiple languages.\n* CLVs to capture crosslingual semantic role patterns and generate aligned roles.", "evaluation": "f1"}
{"query": "Which parallel corpora are used?", "gt": "English (EN) and German (DE) sections of the CoNLL 2009 corpus BIBREF13, EN-DE section of the Europarl corpus BIBREF14", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 19 words.[/INST]", "llama2-13b-chat-4k_pred": "  The parallel corpora used are the English (EN) and German (DE) sections of the CoNLL 2009 corpus and the EN-DE section of the Europarl corpus.", "evaluation": "f1"}
{"query": "Overall, does having parallel data improve semantic role induction across multiple languages?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "Do they add one latent variable for each language pair in their Bayesian model?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "What does an individual model consist of?", "gt": "Bayesian model of garg2012unsupervised as our base monolingual model", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 9 words.[/INST]", "llama2-13b-chat-4k_pred": "  An individual model consists of a monolingual model and crosslingual latent variables.", "evaluation": "f1"}
{"query": "Do they improve on state-of-the-art semantic role induction?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "how many tags do they look at?", "gt": "48,705", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to the question \"how many tags do they look at?\" in one word:\n\n\"four\"", "evaluation": "f1"}
{"query": "which algorithm was the highest performer?", "gt": "A hybrid model consisting of best performing popularity-based approach with the best similarity-based approach", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 14 words.[/INST]", "llama2-13b-chat-4k_pred": "  The hybrid approach that combines popularity-based and similarity-based methods achieved the highest performance.", "evaluation": "f1"}
{"query": "how is diversity measured?", "gt": " the average dissimilarity of all pairs of tags in the list of recommended tags", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 14 words.[/INST]", "llama2-13b-chat-4k_pred": "  Recommendation diversity is measured by calculating the average dissimilarity of all pairs of tags in the list of recommended tags.", "evaluation": "f1"}
{"query": "how large is the vocabulary?", "gt": "33,663", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Large.", "evaluation": "f1"}
{"query": "what dataset was used?", "gt": " E-book annotation data: editor tags, Amazon search terms, and  Amazon review keywords.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset used was a collection of e-book annotation data, including editor tags and Amazon search terms.", "evaluation": "f1"}
{"query": "what algorithms did they use?", "gt": "popularity-based, similarity-based, hybrid", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  The authors used various algorithms, including:\n\n1. Popularity-based approaches\n2. Similarity-based approaches\n3. Hybrid approaches.", "evaluation": "f1"}
{"query": "What baseline method is used?", "gt": "use the word2vec algorithm, create several unsupervised hand-crafted features, generate document vectors and feed them as input into the support vector machines (SVM) approach", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 24 words.[/INST]", "llama2-13b-chat-4k_pred": "  The baseline method used in the paper is the use of unsupervised word embeddings, such as word2vec, which takes into account the syntactic and semantic representations of words but does not consider sentimental aspects.", "evaluation": "f1"}
{"query": "What details are given about the Twitter dataset?", "gt": "one of the Twitter datasets is about Turkish mobile network operators, there are positive, neutral and negative labels and provide the total amount plus the distribution of labels", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 28 words.[/INST]", "llama2-13b-chat-4k_pred": "  The Twitter dataset details provided are: number of tweets (10,000), number of unique words (5,000), and the proportion of positive and negative words (approximately 1:3).", "evaluation": "f1"}
{"query": "What details are given about the movie domain dataset?", "gt": "there are 20,244 reviews divided into positive and negative with an average 39 words per review, each one having a star-rating score", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 22 words.[/INST]", "llama2-13b-chat-4k_pred": "  The details given about the movie domain dataset are: it consists of 2000 reviews, each review has a sentiment score of 1-5, and the dataset is imbalanced with 1500 positive and 500 negative reviews.", "evaluation": "f1"}
{"query": "Which hand-crafted features are combined with word2vec?", "gt": "polarity scores, which are minimum, mean, and maximum polarity scores, from each review", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 13 words.[/INST]", "llama2-13b-chat-4k_pred": "  Hand-crafted features combined with word2vec include sentiment scores, minimum, mean, and maximum polarity scores.", "evaluation": "f1"}
{"query": "What word-based and dictionary-based feature are used?", "gt": "generate word embeddings specific to a domain, TDK (T\u00fcrk Dil Kurumu - \u201cTurkish Language Institution\u201d) dictionary to obtain word polarities", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 20 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is using word-based features such as word2vec embeddings and dictionary-based features such as the TDK dictionary to generate document vectors.", "evaluation": "f1"}
{"query": "How are the supervised scores of the words calculated?", "gt": "(+1 or -1), words of opposite polarities (e.g. \u201chappy\" and \u201cunhappy\") get far away from each other", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 17 words.[/INST]", "llama2-13b-chat-4k_pred": "  The supervised scores of the words are calculated using the raw supervised polarity scores, normalized by the number of documents of positive polarity and the number of documents of negative polarity.", "evaluation": "f1"}
{"query": "What topic is covered in the Chinese Facebook data? ", "gt": "anti-nuclear-power", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Nuclear power.", "evaluation": "f1"}
{"query": "How many layers does the UTCNN model have?", "gt": "eight layers", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Three layers.", "evaluation": "f1"}
{"query": "What topics are included in the debate data?", "gt": "abortion, gay rights, Obama, marijuana", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 5 words.[/INST]", "llama2-13b-chat-4k_pred": "  The debate data includes various topics.", "evaluation": "f1"}
{"query": "What topics are included in the debate data?", "gt": "abortion (ABO), gay rights (GAY), Obama (OBA), and marijuana (MAR)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The debate data includes topics such as nuclear power, abortion, and gay rights.", "evaluation": "f1"}
{"query": "What is the size of the Chinese data?", "gt": "32,595", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  The size of the Chinese data is 2,496 authors, 505,137 likers, 33,686 commenters, and 505,412 unique users.", "evaluation": "f1"}
{"query": "Did they collected the two datasets?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "What are the baselines?", "gt": "SVM with unigram, bigram, trigram features, with average word embedding, with average transformed word embeddings, CNN and RCNN, SVM, CNN, RCNN with comment information", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 24 words.[/INST]", "llama2-13b-chat-4k_pred": "  The baselines are the existing stance classification models, including CNN-based, LSTM-based, and hybrid models.", "evaluation": "f1"}
{"query": "why are their techniques cheaper to implement?", "gt": "They use a slightly modified copy of the target to create the pseudo-text instead of full BT to make their technique cheaper", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 22 words.[/INST]", "llama2-13b-chat-4k_pred": "  The techniques presented in the paper are cheaper to implement as they use existing parallel data and do not require additional annotation or expensive hardware.", "evaluation": "f1"}
{"query": "what data simulation techniques were introduced?", "gt": "copy, copy-marked, copy-dummies", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here are three words that answer the question:\n\nBackward Translation.", "evaluation": "f1"}
{"query": "what is their explanation for the effectiveness of back-translation?", "gt": "when using BT, cases where the source is shorter than the target are rarer; cases when they have the same length are more frequent, automatic word alignments between artificial sources tend to be more monotonic than when using natural sources", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 40 words.[/INST]", "llama2-13b-chat-4k_pred": "  The article explains that back-translation is effective in NMT due to its ability to provide a large amount of artificial parallel data, which can help the model learn to translate more effectively. The use of back-translation can improve the performance of the model, especially in low-resource settings.", "evaluation": "f1"}
{"query": "what dataset is used?", "gt": "Europarl corpus , WMT newstest 2014, News-Commentary-11, Wikipedia from WMT 2014, Multi-UN, EU-Bookshop, Rapid, Common-Crawl (WMT 2017)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 17 words.[/INST]", "llama2-13b-chat-4k_pred": "  The given document discusses the use of monolingual data in neural machine translation (NMT) systems. The author explores various strategies to integrate monolingual data into NMT and evaluates their effectiveness. The document also analyzes the properties of back-translated data and compares the performance of different NMT systems.\n\nBased on the given document, the following are the answers to the instructions:\n\n1. What is the main topic of the document?\nAnswer: The main topic of the document is the use of monolingual data in NMT systems.\n2. What is the purpose of the study?\nAnswer: The purpose of the study is to explore the effectiveness of different strategies for integrating monolingual data into NMT systems and to evaluate the properties of back-translated data.\n3. What is the dataset used in the study?\nAnswer: The dataset used in the study is the Europarl corpus, which is a large parallel corpus of European Parliament proceedings.\n4. What are the two language pairs studied in the paper?\nAnswer: The two language pairs studied in the paper are English-French and English-German.\n5. What is the size of the out-of-domain parallel corpus used in the study?\nAnswer: The size of the out-of-domain parallel corpus used in the study is 4 million sentences.\n6. What is the name of the baseline NMT system used in the study?\nAnswer: The name of the baseline NMT system used in the study is the attentional encoder-decoder system.", "evaluation": "f1"}
{"query": "what language pairs are explored?", "gt": "English-German, English-French", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  English-French and English-German.", "evaluation": "f1"}
{"query": "what language is the data in?", "gt": "English , German, French", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 4 words.[/INST]", "llama2-13b-chat-4k_pred": "  The given document is discussing the use of neural machine translation and the impact of back-translated data on the translation process. The document is written in English.", "evaluation": "f1"}
{"query": "What metrics are used for evaluation?", "gt": "BLEU-1/4 and ROUGE-L, likelihood of generated recipes using identical input specifications but conditioned on ten different user profiles, user matching accuracy (UMA), Mean Reciprocal Rank (MRR), neural scoring model from BIBREF33 to measure recipe-level coherence", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 35 words.[/INST]", "llama2-13b-chat-4k_pred": "  The metrics used for evaluation are:\n\n* BPE perplexity\n* BLEU-1/4 and ROUGE-L\n* Distinct-1/2 (percentage of distinct unigrams and bigrams)\n* User Matching Accuracy (UMA)\n* Mean Reciprocal Rank (MRR)\n* Recipe-level coherence score.", "evaluation": "f1"}
{"query": "What natural language(s) are the recipes written in?", "gt": "English", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  English.", "evaluation": "f1"}
{"query": "What were their results on the new dataset?", "gt": "average recipe-level coherence scores of 1.78-1.82, human evaluators preferred personalized model outputs to baseline 63% of the time", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 18 words.[/INST]", "llama2-13b-chat-4k_pred": "  The authors achieved state-of-the-art results on a new recipe generation dataset, with their personalized models outperforming baselines in perplexity, BLEU, and ROUGE metrics.", "evaluation": "f1"}
{"query": "What are the baseline models?", "gt": "name-based Nearest-Neighbor model (NN), Encoder-Decoder baseline with ingredient attention (Enc-Dec)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The baseline models are Nearest-Neighbor and Encoder-Decoder.", "evaluation": "f1"}
{"query": "How did they obtain the interactions?", "gt": "from Food.com", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  User reviews.", "evaluation": "f1"}
{"query": "Where do they get the recipes from?", "gt": "from Food.com", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Food.com.", "evaluation": "f1"}
{"query": "what are their results on the constructed dataset?", "gt": "AutoJudge consistently and significantly outperforms all the baselines, RC models achieve better performance than most text classification models (excluding GRU+Attention), Comparing with conventional RC models, AutoJudge achieves significant improvement", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 29 words.[/INST]", "llama2-13b-chat-4k_pred": "  The results of the experiments on the constructed dataset showed that the proposed LRC framework and AutoJudge model achieved significant improvements over previous methods, with an average accuracy of 85.2% on the test set. The ablation tests also demonstrated the effectiveness of the proposed model components.", "evaluation": "f1"}
{"query": "what evaluation metrics are reported?", "gt": "precision, recall, F1 and accuracy", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 5 words.[/INST]", "llama2-13b-chat-4k_pred": "  The evaluation metrics reported are accuracy, F1-score, and AUC-ROC.", "evaluation": "f1"}
{"query": "what civil field is the dataset about?", "gt": "divorce", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Based on the given document, the civil field is \"legal\".", "evaluation": "f1"}
{"query": "what are the state-of-the-art models?", "gt": "SVM , CNN , GRU , CNN/GRU+law, r-net , AoA ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The state-of-the-art models are based on deep learning techniques such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer architectures. These models have achieved high accuracy on various natural language processing (NLP) tasks, including language translation, question answering, and text classification. Some of the state-of-the-art models include:\n\n1. BERT (Bidirectional Encoder Representations from Transformers): A pre-trained language model developed by Google that has achieved state-of-the-art results on a wide range of NLP tasks.\n2. RoBERTa (Robustly Optimized BERT Pretraining Approach): A variant of BERT that was specifically designed for text classification tasks and has achieved high accuracy on several benchmark datasets.\n3. DistilBERT (Distilled BERT): A smaller and more efficient version of BERT that has been trained to match the performance of the full BERT model while requiring fewer computational resources.\n4. Longformer (Long-range dependence transformer): A transformer-based architecture that is designed to handle long-range dependencies in sequential data, such as text, and has achieved high accuracy on several NLP tasks.\n5. ELECTRA (Efficient Lifelong End-to-End Text Recognition with Attention): A text classification model that uses a combination of CNN and RNN layers to achieve high accuracy on several benchmark datasets.\n\nThese models have been trained on large datasets of text data and have been shown to be highly effective at tasks such as text classification, sentiment analysis, and question answering.", "evaluation": "f1"}
{"query": "what is the size of the real-world civil case dataset?", "gt": "100 000 documents", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  The size of the real-world civil case dataset is not specified in the text.", "evaluation": "f1"}
{"query": "what datasets are used in the experiment?", "gt": "build a new one, collect INLINEFORM0 cases from China Judgments Online", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 11 words.[/INST]", "llama2-13b-chat-4k_pred": "  The experiment uses the civil law system of mainland China's case documents.", "evaluation": "f1"}
{"query": "What domains are covered in the corpus?", "gt": "No specific domain is covered in the corpus.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 8 words.[/INST]", "llama2-13b-chat-4k_pred": "  The domains covered in the corpus are multilingual speech.", "evaluation": "f1"}
{"query": "What is the architecture of their model?", "gt": "follow the architecture in berard2018end, but have 3 decoder layers like that in pino2019harnessing", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 14 words.[/INST]", "llama2-13b-chat-4k_pred": "  The architecture of their model is a Transformer base architecture with 3 encoder layers, 3 decoder layers, and 0.3 dropout.", "evaluation": "f1"}
{"query": "How was the dataset collected?", "gt": "Contributors record voice clips by reading from a bank of donated sentences.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset was collected through a crowdsourcing speech recognition corpus with an open CC0 license.", "evaluation": "f1"}
{"query": "Which languages are part of the corpus?", "gt": "French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The corpus includes French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, and Chinese.", "evaluation": "f1"}
{"query": "How is the quality of the data empirically evaluated? ", "gt": "Validated transcripts were sent to professional translators., various sanity checks to the translations,  sanity check the overlaps of train, development and test sets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 23 words.[/INST]", "llama2-13b-chat-4k_pred": "  The quality of the data is empirically evaluated using various sanity checks, including sentence-level BLEU scores, perplexity measures, and manual inspection of translations.", "evaluation": "f1"}
{"query": "Is the data in CoVoST annotated for dialect?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "Is Arabic one of the 11 languages in CoVost?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "What kind of model do they use?", "gt": "visual model is based on fine-tuning an Inception V3 model BIBREF1 over visual renderings of documents, while our textual model is based on a hierarchical biLSTM. We further combine the two into a joint model. , neural network models", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 39 words.[/INST]", "llama2-13b-chat-4k_pred": "  The authors use a joint model that combines visual and textual embeddings. The visual embeddings are obtained using an Inception V3 model, while the textual embeddings are obtained using a biLSTM model. The two embeddings are then combined using a simple feed-forward layer and softmax over the document label set.", "evaluation": "f1"}
{"query": "What kind of model do they use?", "gt": "Inception V3, biLSTM", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  The model used is a joint model.", "evaluation": "f1"}
{"query": "Did they release their data set of academic papers?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "Do the methods that work best on academic papers also work best on Wikipedia?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "What is their system's absolute accuracy?", "gt": "59.4% on wikipedia dataset, 93.4% on peer-reviewed archive AI papers, 77.1%  on peer-reviewed archive Computation and Language papers, and 79.9% on peer-reviewed archive Machine Learning papers", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 26 words.[/INST]", "llama2-13b-chat-4k_pred": "  The system's absolute accuracy is not provided in the given text.", "evaluation": "f1"}
{"query": "Which is more useful, visual or textual features?", "gt": "It depends on the dataset. Experimental results over two datasets reveal that textual and visual features are complementary. ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 18 words.[/INST]", "llama2-13b-chat-4k_pred": "  Both visual and textual features are useful, depending on the context and task at hand.", "evaluation": "f1"}
{"query": "Which languages do they use?", "gt": "English", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  English, Python, and LaTeX.", "evaluation": "f1"}
{"query": "How large is their data set?", "gt": "a sample of  29,794 wikipedia articles and 2,794 arXiv papers ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The data set is approximately 29,794 documents.", "evaluation": "f1"}
{"query": "Where do they get their ground truth quality judgments?", "gt": "quality class labels assigned by the Wikipedia community, a paper is considered to have been accepted (i.e. is positively labeled) if it matches a paper in the DBLP database or is otherwise accepted by any of the following conferences: ACL, EMNLP, NAACL, EACL, TACL, NIPS, ICML, ICLR, or AAAI", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 49 words.[/INST]", "llama2-13b-chat-4k_pred": "  The ground truth quality judgments for the Wikipedia articles are assigned by the Wikipedia community through a consensus-based process, while the arXiv dataset uses a combination of manual and automated processes to determine the quality of the articles.", "evaluation": "f1"}
{"query": "What is the seed lexicon?", "gt": "a vocabulary of positive and negative predicates that helps determine the polarity score of an event", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 16 words.[/INST]", "llama2-13b-chat-4k_pred": "  The seed lexicon is a list of 15 positive words and 15 negative words used to represent the polarity of events in the affective event recognition task.", "evaluation": "f1"}
{"query": "What is the seed lexicon?", "gt": "seed lexicon consists of positive and negative predicates", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 8 words.[/INST]", "llama2-13b-chat-4k_pred": "  The seed lexicon is a list of positive and negative words.", "evaluation": "f1"}
{"query": "What are the results?", "gt": "Using all data to train: AL -- BiGRU achieved 0.843 accuracy, AL -- BERT achieved 0.863 accuracy, AL+CA+CO -- BiGRU achieved 0.866 accuracy, AL+CA+CO -- BERT achieved 0.835, accuracy, ACP -- BiGRU achieved 0.919 accuracy, ACP -- BERT achived 0.933, accuracy, ACP+AL+CA+CO -- BiGRU achieved 0.917 accuracy, ACP+AL+CA+CO -- BERT achieved 0.913 accuracy. \nUsing a subset to train: BERT achieved 0.876 accuracy using ACP (6K), BERT achieved 0.886 accuracy using ACP (6K) + AL, BiGRU achieved 0.830 accuracy using ACP (6K), BiGRU achieved 0.879 accuracy using ACP (6K) + AL + CA + CO.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 94 words.[/INST]", "llama2-13b-chat-4k_pred": "  The results of the experiment showed that the proposed method outperformed the random baseline and the seed lexicon-based baseline, with an accuracy of 83.3%. The model variants using BiGRU and BERT encoders achieved the highest scores, with BiGRU being the most effective. The performance of the supervised models (ACP) was higher than that of the semi-supervised models (ACP+AL+CA+CO), suggesting that the training set of 0.6 million events was sufficient for training the models.", "evaluation": "f1"}
{"query": "How are relations used to propagate polarity?", "gt": "cause relation: both events in the relation should have the same polarity; concession relation: events should have opposite polarity", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 19 words.[/INST]", "llama2-13b-chat-4k_pred": "  Discourse relations like cause and concession are used to propagate polarity in affective events. Positive events are linked to positive relations, while negative events are linked to negative relations. This helps to efficiently recognize the polarity of events.", "evaluation": "f1"}
{"query": "How big is the Japanese data?", "gt": "The ACP corpus has around 700k events split into positive and negative polarity ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 13 words.[/INST]", "llama2-13b-chat-4k_pred": "  The Japanese data is approximately 100 million sentences.", "evaluation": "f1"}
{"query": "What are labels available in dataset for supervision?", "gt": "negative, positive", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Positive/Negative.", "evaluation": "f1"}
{"query": "How big are improvements of supervszed learning results trained on smalled labeled data enhanced with proposed approach copared to basic approach?", "gt": "3%", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Improved: 0.6 million events is sufficient for supervised learning results, and the proposed approach improves the results by leveraging discourse relations to efficiently propagate polarity.", "evaluation": "f1"}
{"query": "How does their model learn using mostly raw data?", "gt": "by exploiting discourse relations to propagate polarity from seed predicates to final sentiment polarity", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 14 words.[/INST]", "llama2-13b-chat-4k_pred": "  The model learns using raw data by exploiting discourse relations to efficiently propagate polarity from seed predicates, despite the non-compositionality of affective polarity.", "evaluation": "f1"}
{"query": "How big is seed lexicon used for training?", "gt": "30 words", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Small seed lexicon.", "evaluation": "f1"}
{"query": "How large is raw corpus used for training?", "gt": "100 million sentences", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  Raw corpus size: 100 million sentences.", "evaluation": "f1"}
{"query": "What data is the language model pretrained on?", "gt": "Chinese general corpus", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 3 words.[/INST]", "llama2-13b-chat-4k_pred": "  Pretrained on large.", "evaluation": "f1"}
{"query": "What baselines is the proposed model compared against?", "gt": "BERT-Base, QANet", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to your question:\n\nThe proposed model is compared against two baselines:\n\n1. BERT-base (a pre-trained language model)\n2. CNER (a clinical named entity recognition model)", "evaluation": "f1"}
{"query": "How is the clinical text structuring task defined?", "gt": "Clinical text structuring (CTS) is a critical task for fetching medical research data from electronic health records (EHRs), where structural patient medical data, such as whether the patient has specific symptoms, diseases, or what the tumor size is, how far from the tumor is cut at during the surgery, or what the specific laboratory test result is, are obtained., Unlike the traditional CTS task, our QA-CTS task aims to discover the most related text from original paragraph text. ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 78 words.[/INST]", "llama2-13b-chat-4k_pred": "  The clinical text structuring task is defined as a question answering task, where the goal is to extract or generate a key-value pair where the key is a query term and the value is a result of the query term according to the paragraph text. The task is divided into two steps: identifying the answer-related text and generating the final answer. The proposed model integrates clinical named entity information into a pre-trained language model to improve performance and make the training data shareable.", "evaluation": "f1"}
{"query": "How is the clinical text structuring task defined?", "gt": "CTS is extracting structural data from medical research data (unstructured). Authors define QA-CTS task that aims to discover most related text from original text.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 24 words.[/INST]", "llama2-13b-chat-4k_pred": "  The clinical text structuring task is defined as a question answering task, where the goal is to extract or generate a key-value pair based on a query term and a paragraph of text. The key is the query term, and the value is the answer related to the query term in the text.", "evaluation": "f1"}
{"query": "What are the specific tasks being unified?", "gt": " three types of questions, namely tumor size, proximal resection margin and distal resection margin", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 14 words.[/INST]", "llama2-13b-chat-4k_pred": "  The specific tasks being unified are various clinical text structuring tasks, such as extracting specific-class classifications, strings in the original text, and inferred values from part of the original text.", "evaluation": "f1"}
{"query": "Is all text in this dataset a question, or are there unrelated sentences in between questions?", "gt": "the dataset consists of pathology reports including sentences and questions and answers about tumor size and resection margins so it does include additional sentences ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 24 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset contains a mix of questions and unrelated sentences, with questions being the primary focus.", "evaluation": "f1"}
{"query": "How many questions are in the dataset?", "gt": "2,714 ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Based on the provided document, there are 5 questions in the dataset.", "evaluation": "f1"}
{"query": "How they introduce domain-specific features into pre-trained language model?", "gt": "integrate clinical named entity information into pre-trained language model", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 9 words.[/INST]", "llama2-13b-chat-4k_pred": "  The authors introduce domain-specific features by incorporating clinical named entity information into the pre-trained language model.", "evaluation": "f1"}
{"query": "How big is QA-CTS task dataset?", "gt": "17,833 sentences, 826,987 characters and 2,714 question-answer pairs", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 8 words.[/INST]", "llama2-13b-chat-4k_pred": "  The QA-CTS task dataset is not specified in the given text.", "evaluation": "f1"}
{"query": "What are strong baseline models in specific tasks?", "gt": "state-of-the-art question answering models (i.e. QANet BIBREF39) and BERT-Base BIBREF26", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure, I'd be happy to help!\n\nThe strong baseline models in specific tasks are:\n\n1. Task-specific end-to-end methods BIBREF3, BIBREF4, which use large amounts of data to automatically model the specific task.\n2. Pre-trained language model BERT BIBREF26, which captures language information from text and improves the performance of specific natural language processing tasks.\n\nPlease let me know if you have any further questions or if there's anything else I can help with!", "evaluation": "f1"}
{"query": "What is the previous work's model?", "gt": "Ternary Trans-CNN", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here is the answer to your question with 2 words:\n\nPrevious work: Transfer Learning", "evaluation": "f1"}
{"query": "What dataset is used?", "gt": "HEOT , A labelled dataset for a corresponding english tweets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 10 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document discusses a study on detecting offensive and non-offensive tweets in Hinglish, a linguistic blend of Hindi and English. The study used a labeled dataset of tweets and employed various deep learning models to classify the tweets as offensive or non-offensive. The best results were obtained using a Bidirectional LSTM with 32 units and a recurrent dropout rate of 0.2. The study also explored the use of data augmentation techniques to improve the performance of the model.", "evaluation": "f1"}
{"query": "How big is the dataset?", "gt": "3189 rows of text messages", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 5 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset is relatively small.", "evaluation": "f1"}
{"query": "How big is the dataset?", "gt": "Resulting dataset was 7934 messages for train and 700 messages for test.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is discussing a research study on detecting offensive and non-offensive tweets in the Hinglish language, which is a blend of Hindi and English spoken in India. The study used various machine learning models and techniques to classify tweets as either offensive or non-offensive. The results showed that the models performed well in detecting offensive tweets, but there is room for improvement in detecting hateful tweets. The study also highlighted the need for more extensive training of word embeddings for Hinglish text.", "evaluation": "f1"}
{"query": "How is the dataset collected?", "gt": "A labelled dataset for a corresponding english tweets were also obtained from a study conducted by Davidson et al, HEOT obtained from one of the past studies done by Mathur et al", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 32 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is discussing a research study on detecting offensive and non-offensive tweets written in Hinglish, a linguistic blend of Hindi and English. The study used a dataset of labeled tweets and employed various machine learning models to classify the tweets as offensive or non-offensive. The results showed that the models performed well in detecting offensive tweets, but struggled with hate speech detection. The study also discussed the challenges of using data augmentation techniques and the need for more extensive training to improve the word representations of Hinglish text.", "evaluation": "f1"}
{"query": "Was each text augmentation technique experimented individually?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "What models do previous work use?", "gt": "Ternary Trans-CNN , Hybrid multi-channel CNN and LSTM", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 8 words.[/INST]", "llama2-13b-chat-4k_pred": "  Previous work uses various deep learning models.", "evaluation": "f1"}
{"query": "Does the dataset contain content from various social media platforms?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "What additional techniques are incorporated?", "gt": "incorporating coding syntax tree model", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 5 words.[/INST]", "llama2-13b-chat-4k_pred": "  Additional techniques incorporated:\n\n1. Tokenization\n2. Vocabulary generation\n3. Neural model training\n4. Attention mechanism\n5. Sigmoidal optimization", "evaluation": "f1"}
{"query": "What dataset do they use?", "gt": "A parallel corpus where the source is an English expression of code and the target is Python code.", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 18 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset used in this research is a text-code parallel corpus with 18805 aligned data.", "evaluation": "f1"}
{"query": "Do they compare to other models?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Yes.", "evaluation": "f1"}
{"query": "What is the architecture of the system?", "gt": "seq2seq translation", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to the question \"What is the architecture of the system?\" with 2 words:\n\nNeural network", "evaluation": "f1"}
{"query": "What additional techniques could be incorporated to further improve accuracy?", "gt": "phrase-based word embedding, Abstract Syntax Tree(AST)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 6 words.[/INST]", "llama2-13b-chat-4k_pred": "  Incorporate phrase-based word embedding.", "evaluation": "f1"}
{"query": "What programming language is target language?", "gt": "Python", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Based on the information provided in the document, the target language is Python.", "evaluation": "f1"}
{"query": "What dataset is used to measure accuracy?", "gt": "validation data", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to your question:\n\n\"What dataset is used to measure accuracy?\"\n\nAnswer: Training parallel corpus.", "evaluation": "f1"}
{"query": "What corpus was the source of the OpenIE extractions?", "gt": "domain-targeted $~$ 80K sentences and 280 GB of plain text extracted from web pages used by BIBREF6 aristo2016:combining", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 18 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is a research paper discussing a new system for answering complex questions using open IE tuples. The system, called TupleInf, is designed to reason with open IE tuples for complex multiple-choice questions that require tuples from multiple sentences. The paper discusses the challenges of using open IE tuples for question answering and presents the TupleInf system as a solution. The system uses a new ILP-based model of inference with tuples, which is simpler and more effective than previous methods. The paper also compares the performance of TupleInf with two state-of-the-art baselines and shows that TupleInf outperforms them by more than 11%. The document also discusses four classes of failures that were observed and suggests future work to address these issues.\n\nThe document is divided into the following sections:\n\n1. Introduction: This section provides an overview of the paper and the research question.\n2. Background: This section discusses the related work in question answering and the use of open IE tuples.\n3. TupleInf: This section describes the TupleInf system and its components, including the tuple KB, the selection process, and the support graph search.\n4. Experiments: This section presents the results of the experiments conducted to evaluate the performance of TupleInf.\n5. Error Analysis: This section discusses four classes of failures that were observed and suggests future work to address these issues.\n6. Conclusion: This section summarizes the main findings of the paper and discusses the implications for future research.\n\nThe document is written in a formal and technical style, with a focus on presenting the research findings and their significance. The language used is precise and specialized, reflecting the domain-specific terminology of the field. The document includes tables and figures to support the text and provide additional information. Overall, the document is well-organized and easy to follow, with a clear structure and logical flow of ideas.", "evaluation": "f1"}
{"query": "What is the accuracy of the proposed technique?", "gt": "51.7 and 51.6 on 4th and 8th grade question sets with no curated knowledge. 47.5 and 48.0 on 4th and 8th grade question sets when both solvers are given the same knowledge", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 32 words.[/INST]", "llama2-13b-chat-4k_pred": "  Based on the given document, I can provide the following instructions:\n\n1. Summarize the main idea of the document in 32 words or less.\n2. Answer the question \"What is the accuracy of the proposed technique?\" in 32 words or less.\n3. Provide a summary of the document in 150 words or less.\n\nHere are my responses:\n\n1. Summarize the main idea of the document in 32 words or less:\nThe document describes a new system for answering complex questions using Open IE tuples and a simple ILP-based model, which outperforms existing systems without requiring manually curated tables.\n\n2. Answer the question \"What is the accuracy of the proposed technique?\" in 32 words or less:\nThe proposed technique achieves high accuracy on a broad set of science questions without requiring manually curated tables.\n\n3. Provide a summary of the document in 150 words or less:\nThe document presents a new system for answering complex questions using Open IE tuples and a simple ILP-based model. The system outperforms existing systems without requiring manually curated tables, and demonstrates the potential for effective question answering using Open IE tuples.", "evaluation": "f1"}
{"query": "Is an entity linking process used?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  Based on the given document, I have the following instructions:\n\n1. Answer the question \"Which object in our solar system reflects light and is a satellite that orbits around one planet?\" using a multiple-choice format with the given options (A) Earth, (B) Mercury, (C) the Sun, and (D) the Moon.\n2. Summarize the main idea of the given document, which is about a new ILP-based model of inference with tuples that can answer complex questions in resource-poor domains where curated knowledge is unavailable.\n3. Provide a summary of the related work section of the document, which discusses two classes of related work: retrieval-based web question-answering and science question-answering.\n4. Identify the main contributions of the document, which include the use of Open IE based QA, a new ILP-based model of inference with tuples, and the ability to generalize well to higher grade levels.\n5. Highlight any notable failures or limitations of the proposed approach, such as the potential for missing important words, lossy IE, bad alignment, and out-of-scope answers.\n\nPlease let me know if you have any further instructions or questions based on the given document.", "evaluation": "f1"}
{"query": "Are the OpenIE extractions all triples?", "gt": "No", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is a research paper discussing a new system for answering complex questions using open IE tuples. The system, called TupleInf, is designed to reason with open IE tuples for multiple-choice questions that require tuples from multiple sentences. The paper discusses the challenges of using open IE tuples for complex questions and how TupleInf addresses these challenges. The system uses a new ILP-based model of inference with tuples, which is simpler and more effective than previous methods. The paper also compares the performance of TupleInf with two state-of-the-art baselines and shows that TupleInf outperforms them by more than 11%. The document also discusses four classes of failures that were observed and suggests future work to address these issues.\n\nThe document is divided into the following sections:\n\n1. Introduction: The introduction provides background information on the problem of answering complex questions using open IE tuples and the need for a system that can reason with these tuples.\n2. Related Work: This section discusses two classes of related work: retrieval-based web question-answering and science question-answering.\n3. Tuple Inference Solver: This section describes the tuples and tuple KB used by the TupleInf system, as well as the selection of tuples and the support graph search algorithm used to find the best support graph.\n4. Results: This section presents the results of the experiments comparing the performance of TupleInf with two state-of-the-art baselines.\n5. Error Analysis: This section discusses four classes of failures that were observed in the system and suggests future work to address these issues.\n\nIn summary, the document presents a new system for answering complex questions using open IE tuples and compares the performance of the system with two state-of-the-art baselines. The system shows promising results and the error analysis suggests future work to improve the system.", "evaluation": "f1"}
{"query": "What method was used to generate the OpenIE extractions?", "gt": "for each multiple-choice question $(q,A) \\in Q_\\mathit {tr}$ and each choice $a \\in A$ , we use all non-stopword tokens in $q$ and $a$ as an ElasticSearch query against S, take the top 200 hits, run Open IE v4, and aggregate the resulting tuples over all $a \\in A$ and over all questions in $Q_\\mathit {tr}$", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 56 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is a research paper discussing a new system for answering complex questions using open IE tuples. The system, called TupleInf, is designed to reason with open IE tuples for multiple-choice questions that require tuples from multiple sentences. The paper discusses the challenges of using open IE tuples for complex questions and how TupleInf addresses these challenges. The system uses a new ILP-based model of inference with tuples, which is simpler and more effective than previous methods. The paper also compares TupleInf with two state-of-the-art baselines and shows that it outperforms them by more than 11%. The document also discusses four classes of failures that the system encountered and suggests future work to address these issues.\n\nThe document is divided into the following sections:\n\n1. Introduction: The introduction provides background information on the problem of answering complex questions using open IE tuples and the need for a system that can reason with these tuples.\n2. Related Work: This section discusses two classes of related work: retrieval-based web question-answering and science question-answering.\n3. Tuple Inference Solver: This section describes the tuples and tuple KB used by the system, as well as the selection of tuples and the support graph search algorithm used to find the best support graph.\n4. Results: This section presents the results of the system on two question sets and compares it with two state-of-the-art baselines.\n5. Error Analysis: This section discusses four classes of failures encountered by the system and suggests future work to address these issues.\n6. Conclusion: The conclusion summarizes the main findings of the paper and highlights the potential of TupleInf for answering complex questions in resource-poor domains.", "evaluation": "f1"}
{"query": "Can the method answer multi-hop questions?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is a research paper discussing a new system for answering complex questions based on open IE tuples. The system, called TupleInf, is designed to reason with open IE tuples for complex multiple-choice questions that require tuples from multiple sentences. The paper discusses the challenges of using open IE tuples for question answering, such as the lack of a large-scale structured KB and the need for combining multiple tuples to support different parts of the question. The authors describe the TupleInf system, which uses a new ILP-based model of inference with tuples, and compare it to two state-of-the-art baselines. The results show that TupleInf outperforms the baselines on a broad set of science questions, and the authors discuss four classes of failures that they observed and suggest future work to address these issues.\n\nThe document is divided into the following sections:\n\n1. Introduction: The introduction discusses the challenges of building a question answering system for complex questions based on open IE tuples, and the need for a system that can reason with multiple tuples to support different parts of the question.\n2. Related Work: The section on related work discusses two classes of related work: retrieval-based web question-answering and science question-answering.\n3. Tuple Inference Solver: This section describes the tuples used by the solver, the tuple KB, and the selection of tuples for a given question.\n4. Support Graph Search: This section explains the support graph search algorithm used by the system, which is based on an ILP model.\n5. Experiments: The experiments section compares the performance of TupleInf with two state-of-the-art baselines on two question sets.\n6. Error Analysis: The error analysis section discusses four classes of failures that the system observed and suggests future work to address these issues.\n\nOverall, the document presents a new system for answering complex questions based on open IE tuples, and demonstrates the effectiveness of the system through experiments and error analysis.", "evaluation": "f1"}
{"query": "What was the textual source to which OpenIE was applied?", "gt": "domain-targeted $~$ 80K sentences and 280 GB of plain text extracted from web pages used by BIBREF6 aristo2016:combining", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 18 words.[/INST]", "llama2-13b-chat-4k_pred": "  Based on the given document, I have the following instructions:\n\n1. Answer the question \"Which object in our solar system reflects light and is a satellite that orbits around one planet?\" using a multiple-choice format with the given options.\n2. Summarize the main idea of the article in 18 words.\n3. Identify the four classes of failures observed in the article and explain how they can be addressed in future work.\n\nHere are my answers:\n\n1. Answer: The correct answer is (B) Mercury.\n2. Summary: The article discusses a new ILP-based model of inference with tuples, which is a more effective and efficient way of answering complex questions in resource-poor domains.\n3. Failure classes and future work: The four classes of failures observed in the article are missing important words, lossy IE, bad alignment, and out of scope. To address these failures, future work could include detecting salient question words, improving phrase alignment, and converting sentences to Open IE tuples to avoid losing important information. Additionally, improving the alignment of chunks with different meanings can mitigate the issue of bad alignment.", "evaluation": "f1"}
{"query": "What OpenIE method was used to generate the extractions?", "gt": "for each multiple-choice question $(q,A) \\in Q_\\mathit {tr}$ and each choice $a \\in A$ , we use all non-stopword tokens in $q$ and $a$ as an ElasticSearch query against S, take the top 200 hits, run Open IE v4, and aggregate the resulting tuples over all $a \\in A$ and over all questions in $Q_\\mathit {tr}$", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 56 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is a research paper discussing a new system for answering complex questions using open IE tuples. The system, called TupleInf, is designed to reason with open IE tuples for multiple-choice questions that require tuples from multiple sentences. The paper discusses the challenges of using open IE tuples for complex questions and how TupleInf addresses these challenges. The system uses a new ILP-based model of inference with tuples, which is simpler and more effective than previous methods. The paper also compares TupleInf with two state-of-the-art baselines and shows that it outperforms them by more than 11%. The document also discusses four classes of failures that the system encountered and suggests future work to address these issues.\n\nThe document is divided into the following sections:\n\n1. Introduction: The introduction provides background information on open IE tuples and the challenges of using them for complex questions. It also introduces the new system, TupleInf, and its key features.\n2. Related Work: This section discusses two classes of related work: retrieval-based web question-answering and science question-answering. It also compares TupleInf with these systems.\n3. Tuple Inference Solver: This section describes the tuples and the tuple KB used by the system. It also explains the tuple selection process and the support graph search algorithm used by the system.\n4. Results: This section presents the results of the system on two question sets and compares them with two state-of-the-art baselines.\n5. Error Analysis: This section discusses four classes of failures encountered by the system and suggests future work to address these issues.\n6. Conclusion: The conclusion summarizes the main findings of the paper and highlights the advantages of the TupleInf system.\n\nIn summary, the document presents a new system for answering complex questions using open IE tuples and compares it with two state-of-the-art baselines. The system outperforms the baselines by more than 11% and shows promising results for answering complex questions in resource-poor domains.", "evaluation": "f1"}
{"query": "Is their method capable of multi-hop reasoning?", "gt": "Yes", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  The document is a research paper discussing a new system for answering complex questions based on open IE tuples. The system, called TupleInf, is designed to reason with open IE tuples for complex multiple-choice questions that require tuples from multiple sentences. The paper discusses the challenges of using open IE tuples for question answering and presents a new ILP-based model of inference with tuples. The system is evaluated on a broad set of over 1,300 science questions and is shown to significantly outperform a state-of-the-art system for science question answering. The paper also discusses four classes of failures that were observed and suggests future work to address these issues.\n\nThe document is divided into the following sections:\n\n1. Introduction: The introduction provides background information on the problem of question answering and the use of open IE tuples for complex questions. It also states the purpose of the paper and the research questions being addressed.\n2. Related Work: This section discusses two classes of related work: retrieval-based web question-answering and science question-answering. It also describes the TupleInf system and its key features.\n3. Tuple Inference Solver: This section describes the tuples used by the solver and the tuple KB. It also explains the tuple selection process and the support graph search algorithm used by the solver.\n4. Experimental Results: This section presents the results of the experiments conducted to evaluate the performance of TupleInf. It compares the performance of TupleInf with two state-of-the-art baselines and shows that TupleInf significantly outperforms the baselines.\n5. Error Analysis: This section discusses four classes of failures that were observed during the evaluation of TupleInf. It suggests future work to address these issues.\n\nOverall, the document provides a detailed description of the TupleInf system and its performance on a broad set of science questions. It also discusses the challenges of using open IE tuples for question answering and suggests future work to improve the system's performance.", "evaluation": "f1"}
{"query": "What models do they propose?", "gt": "Feature Concatenation Model (FCM), Spatial Concatenation Model (SCM), Textual Kernels Model (TKM)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The authors propose using a multimodal model that combines text and image features to detect hate speech.", "evaluation": "f1"}
{"query": "Are all tweets in English?", "gt": "Unanswerable", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 1 words.[/INST]", "llama2-13b-chat-4k_pred": "  No.", "evaluation": "f1"}
{"query": "How large is the dataset?", "gt": " $150,000$ tweets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  The dataset is 150,000 tweets large.", "evaluation": "f1"}
{"query": "What is the results of multimodal compared to unimodal models?", "gt": "Unimodal LSTM vs Best Multimodal (FCM)\n- F score: 0.703 vs 0.704\n- AUC: 0.732 vs 0.734 \n- Mean Accuracy: 68.3 vs 68.4 ", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 23 words.[/INST]", "llama2-13b-chat-4k_pred": "  The results show that multimodal models do not outperform unimodal textual models in hate speech detection, indicating that textual information is sufficient for making accurate classifications.", "evaluation": "f1"}
{"query": "What is author's opinion on why current multimodal models cannot outperform models analyzing only text?", "gt": "Noisy data, Complexity and diversity of multimodal relations, Small set of multimodal examples", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 13 words.[/INST]", "llama2-13b-chat-4k_pred": "  The author believes that current multimodal models cannot outperform text-only models because they do not properly consider the context in which hate speech is used.", "evaluation": "f1"}
{"query": "What metrics are used to benchmark the results?", "gt": "F-score, Area Under the ROC Curve (AUC), mean accuracy (ACC), Precision vs Recall plot, ROC curve (which plots the True Positive Rate vs the False Positive Rate)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 27 words.[/INST]", "llama2-13b-chat-4k_pred": "  The metrics used to benchmark the results are accuracy, precision, recall, and F1-score.", "evaluation": "f1"}
{"query": "How is data collected, manual collection or Twitter api?", "gt": "Twitter API", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Twitter API.", "evaluation": "f1"}
{"query": "How many tweats does MMHS150k contains, 150000?", "gt": "$150,000$ tweets", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 2 words.[/INST]", "llama2-13b-chat-4k_pred": "  Sure! Here's the answer to your question:\n\n150,000 tweets.", "evaluation": "f1"}
{"query": "What unimodal detection models were used?", "gt": " single layer LSTM with a 150-dimensional hidden state for hate / not hate classification", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 14 words.[/INST]", "llama2-13b-chat-4k_pred": "  The unimodal detection models used were a single layer LSTM with a 150-dimensional hidden state for hate/not hate classification, and a CNN for image features extraction.", "evaluation": "f1"}
{"query": "What different models for multimodal detection were proposed?", "gt": "Feature Concatenation Model (FCM), Spatial Concatenation Model (SCM), Textual Kernels Model (TKM)", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 12 words.[/INST]", "llama2-13b-chat-4k_pred": "  The proposed models for multimodal detection include CNN, LSTM, and Word2Vec.", "evaluation": "f1"}
{"query": "What annotations are available in the dataset - tweat used hate speach or not?", "gt": "No attacks to any community,  racist, sexist, homophobic, religion based attacks, attacks to other communities", "prompt": "[INST]<<SYS>>\nNow you are given a very long document. Please follow the instruction after this document. These instructions may include summarizing a document, answering questions based on the document, or writing a required paragraph. \n<</SYS>>\n\nDocument is as follows. {} Instruction: {} \nAnswer this question with 15 words.[/INST]", "llama2-13b-chat-4k_pred": "  The MMHS150K dataset includes annotations for hate speech and non-hate speech tweets, with 6 categories: racist, sexist, homophobic, religion-based attacks, and other attacks.", "evaluation": "f1"}
